CODESYS String Libraries

Introduction

The libraries in the CODESYS String Libraries package can be used to process strings which are UTF-8 encoded. The basis is the IString interface from the String Segments library. Using this interface, the strings can be passed to the respective functions by reference. For example, to create an IString instance, the GSB.UTF8String function block from the Generic String Base library is provided.

Table 1. The following libraries are supplied with the package:

`String Segments`	Base functions for `IString` instances	String Segments Library Documentation
`String Builder`	Efficient management of UTF-8 encoded string segments	String Builder Library Documentation
`String Conversions`	Conversion of strings of different encoding to/from UTF-8	String Conversions Library Documentation
`String Functions`	Functions for processing UTF-8 encoded strings following the example of the conventional standard library.	String Functions Library Documentation
`Unicode Support`	Functions for processing Unicode character categories.	Unicode Support Functions Library Documentation
`UTF-16 Encoding Support`	Base function for handling UTF-16 encoded memory areas	UTF-16 Encoding Support Library Documentation
`UTF-8 Encoding Support`	Base function for handling UTF-8 encoded memory areas	UTF-8 Encoding Support Functions Library Documentation
`Generic String Base`	Function blocks for processing UTF-8 encoded strings which manage their memory statically via `GENERIC CONSTANT`.	Generic String Base Functions Library Documentation

Advantages of the new string libraries

Important

The new string libraries do not replace the old familiar string functions of the Standard and Standard64 libraries. Nevertheless, we recommend using the new string libraries for new projects.

The new string libraries can also handle large strings efficiently. The length of the strings is almost unlimited. For this reason, the libraries are also suitable for editing large text files and web contents.

. Other advantages:

UTF-8 is encoding which can represent the full range of characters according to UNICODE.
UTF-8 is widely used on the Internet and is recommended by the World Wide Web Consortium (W3C).
UTF-8 is compatible with legacy systems because of ASCII compatibility.
UTF-8 provides a high level of interoperability.
UTF-8 works to optimize memory.

The new string libraries let you query a previously defined string via corresponding methods, just as you know it from other high-level languages.

Example 1. Example string method: Len()

udiStringLen := myString.Len();
if udiStringLen = 22 THEN
...

As of CODESYS 3.5.18.0, you can set the compiler to interpret the contents of variables of type STRING as UTF-8 encoding. You select the UTF8 Encoding for STRING option in the Project Settings in the Compile options category.

If you do not want to treat all STRING variables in a project as UTF-8 encoded, the you need to clear this option. After that, you can apply UTF-8 encoding to individual literals of the STRING type on a case-by-case basis.

Example 2. UTF-8 encoding for literals

{attribute 'monitoring_encoding' := 'UTF-8'}
sValue : STRING(140) := UTF8#'Ðα ṧтℯ♄ ḯḉℌ ηuη, i¢ℌ αямℯґ 𝕋øґ‼ Ṳᾔⅾ ♭ḯη ＄☺ ḱℓυℊ αł＄ ωⅈ℮ ẕυ√◎ґ';

Thanks to the capabilities of UTF-8 encoding, you do not have to use the WSTRING data type in CODESYS to use an extended character set. UCS-2 encoding, which WSTRING is based on, may require more memory than a UTF-8 encoding, depending on the application. UCS-2 encoding always uses one WORD per character and can represent only the characters U+0000 to U+D800 and U+DFFF to U+FFFD. UTF-8 encoding requires between one and four bytes per character. As a result, all Unicode characters can be processed.

With UTF-8 encoding, if you try to get a specific character using a specific index, then this will lead to unexpected results due to the variable length.

Example 3. Encoding of variable length

{attribute 'monitoring_encoding' := 'UTF-8'}
sValue : STRING(140) := UTF8#'Ðα ṧтℯ♄ ḯḉℌ ηuη, i¢ℌ αямℯґ 𝕋øґ‼ Ṳᾔⅾ ♭ḯη ＄☺ ḱℓυℊ αł＄ ωⅈ℮ ẕυ√◎ґ';

byValue := sValue[13]; // The 'u' is NOT the 13th character in the string
xOk := byValue <> 16#75;

You need to determine the index of a character by iterating through the string.

Example 4. Iteration over UTF-8 encoded strings

VAR	
    {attribute 'monitoring_encoding' := 'UTF-8'}
    sValue : STRING(140) := UTF8#'Ðα ṧтℯ♄ ḯḉℌ ηuη, i¢ℌ αямℯґ 𝕋øґ‼ Ṳᾔⅾ ♭ḯη ＄☺ ḱℓυℊ αł＄ ωⅈ℮ ẕυ√◎ґ';

    fbsValue : STR.UTF8Literal := (psValue:=ADR(sValue)); 
    fbRange : STR.Range := (itfString:=fbsValue);
    diRune : STR.RUNE;
    udiIndex, udiLength : UDINT;
    xOk : BOOL;
END_VAR
 
WHILE (diRune := fbRange.GetNextRune(udiLength=>udiLength)) <> 0 DO
    IF diRune = 16#75 (* 'u' *) THEN
        EXIT;
    END_IF
    udiIndex := udiIndex + udiLength;
END_WHILE
 
xOk := sValue[udiIndex] = 16#75 (* 'u' *);

Disadvantages of the established `STRING` functions

In the previously established STRING functions from the standard library, the parameters of type STRING are copied when they are passed to the functions. The return value is also copied to a variable with the assignment.

Example 5. Problems with the established STRING functions

VAR
    sValue : STRING;
END_VAR
 
sValue := CONCAT(CONCAT(CONCAT('Da steh ich nun,', ' ich armer Tor!'), ' Und bin so'), ' klug als wie zu vor');
//                              -> Copy, LEN       -> Copy, LEN        -> Copy, LEN    -> Copy, LEN
//        -> 2xCopy, LEN
//               -> 2xCopy, LEN
//                      -> 2xCopy, LEN

Before processing the parameters of type STRING in the respective functions, their lengths often have to be determined by iteration up to the terminating null character. For longer strings, these copy and iteration operations increase the processing time of the application. The length of the strings is limited to 255 characters for the application of these functions.

Using the `IString` interface

The STR.IString interface was introduced to pass the data structure which manages the information about a string by reference. This is a major difference to the previously established STRING functions, which do not implement the STR.IString interface.

Furthermore, the size of a string (the respective memory for the UTF-8 encoded characters) may be in the numeric range UDINT 4 ≦ udiSize ≦ 16#FFFF_FF00).

. In the mentioned data structure, the following information is kept up-to-date and does not have to be recalculated each time before a processing step:

Reference to the respective memory segment
Current capacity (→ GetSegment)
Length (→ Len) in bytes
Number of characters (→ RuneCount)

Example 6. Properties of STR.IString

VAR
    itfString : STR.IString;
    udiLength, udiSize, udiRuneCount : UDINT;
    pbySegment : POINTER TO BYTE;
    xValid : BOOL;
END_VAR
 
udiLength := itfString.Len(); // Current length in byte
pbySegment := itfString.GetSegment(udiSize=>udiSize); // Address first byte, capacity of the segment in bytes
udiRuneCount := STR.RuneCount(itfString); // Current number of "characters" in the segment
xValid := itfString.IsValid(); // Indication that a valid UTF-8 encoding is present.

Correlation: "character" and "rune"

The term "rune" appears in the libraries and in the source code and means exactly the same as "Unicode code point", with an interesting addition.

The libraries define the word "rune" as an alias for the type DINT. As a result, the user can clearly see when an integer value represents a code point. Moreover, what can be imagined as a character constant is called a runic constant.

Example: The type and value of the expression WSTRING#"⌘" is a rune with the integer value DINT#16#2318.